For S3D and 4D meta-analysis see (K. Kim, et al. 2017).
In general defining feature importance is a common task. Though it is typically model dependant is
We take inspiration from the Scree plot and try to apply it to the LDA-like approach. Consider a sree plot of flea data. This shows which componets are contributing to the full sample, full dimensionality, \([n,p]\) variation of the data.
The user study task trys to explore the full sample, full dimensionality \([n,p]\) seperation of two specified clusters. In an analogous manner, create a screeplot-like output to evalute the contributions of the variables the cluster seperation (clSep) described in the data. This is related to what R Fisher is attempting in his 1936 paper on discriminant analysis. Similarly we start by finding cluster means and covariances.
Cluster means:
| tars1 | tars2 | head | aede1 | aede2 | aede3 | |
|---|---|---|---|---|---|---|
| Cluster means of: Concinna | 183.1 | 129.6 | 51.24 | 146.2 | 14.10 | 104.9 |
| Cluster means of: Heptapot. | 138.2 | 125.1 | 51.59 | 138.3 | 10.09 | 106.6 |
| Cluster means of: Heikert. | 201.0 | 119.3 | 48.87 | 124.7 | 14.29 | 81.0 |
Cluster variance-covariance matrices: For Concinna , Heptapot. , Heikert. respectively
|
|
|
Suppose the clusters in questions are Concinna and Heptapot. The line between the the cluster means of these groups is their difference. This is sufficeint for Linear Discriminant Analysis which assumes homogenious variation between clusters. We start from Fisher’s Discriminant Analysis, which accounts for within cluster variance.
\[ clSep_{[1,p]} = (\mu_{b[1,p]} - \mu_{a[1,p]})^2~/~(\Sigma_{a[p, p]} + \Sigma_{b[p, p]})~~~;~a,~b~are~clusters \in X_{[n,p]}\]
They we alter the sum of the within cluster covariances to its pooled equivalant (take the weighted average of them.)
\[ Cluster Seperation_{[1,p]} = (\mu_{b[1,p]} - \mu_{a[1,p]})^2~/~ (\Sigma_{a[p, p]} * n_a + \Sigma_{b[p, p]} * n_b) / (n_a + n_b) ~~~;~a,~b~are~clusters \in X_{[n,p]}\]
| var | var_clSep | cumsum_clSep |
|---|---|---|
| tars1 | 0.36 | 0.36 |
| aede1 | 0.27 | 0.63 |
| tars2 | 0.18 | 0.81 |
| aede3 | 0.12 | 0.93 |
| aede2 | 0.05 | 0.98 |
| head | 0.02 | 1.00 |
We discard the sign as we only care about magnitude each variable contributed to the seperation of the specified clusters. We scale the absolute terms by the inverse of the sumation. Now lets visualize this similar to the screeplot.
However, we are concerned about the following case: if the singal of one variable is sufficent to explian the seperation of the clusters, but is dwarfed by the contribution of another variable. We review the correlation matrix to see if our concerns are valid, then we try to about for this by permuting (or shuffling) the values in a single column and watching how the contributions change.
It’s seems our concerns are valid, at least in this case.
Permuting some variables can significantly impact the cluster seperation explained by other variables. especially when they are correlated and a variable with a large contribution is permuted. We save the means of all permuted clSeps over the p permutations into a matrix\([pxp]\). We then find the mean of these means of single-variable permututations for a metric we call mean, mean permuted cluster seperation, or MMP clSep. Comparing with the original full sample clSep we find:
\[MMP~clSep_i = mean_i(mean_i(permuted~reps)) \]
We note that the varable head is a significant case. If we plot MMP clSep be it’s self and order the screeplot accordingly we have
Now that we have a measure we want to define an objective cutoff for evaluation. We want the measure to a few attributes:
Following these, we define a measure to be: \[diff_i = MMP~clSep_i - (1 / p)\] \[marks = \sum_{i=1}^{p} I(response_i) * sgn(diff_i) * \sqrt{|diff_i|}\]
Here, we add lines indicating the weight of each variable if selected as important. we then apply our measure to evalue task responses, we review an example response below:
| var | data_colnum | MMP_clSep | cumsum_MMP_clSep | exampleResponse | diff | weight | marks |
|---|---|---|---|---|---|---|---|
| tars1 | 1 | 0.32 | 0.32 | 1 | 0.15 | 0.39 | 0.39 |
| aede1 | 4 | 0.25 | 0.57 | 1 | 0.08 | 0.29 | 0.29 |
| tars2 | 2 | 0.15 | 0.72 | 0 | -0.02 | -0.13 | 0.00 |
| aede3 | 6 | 0.12 | 0.83 | 1 | -0.05 | -0.23 | -0.23 |
| head | 3 | 0.09 | 0.92 | 0 | -0.07 | -0.27 | 0.00 |
| aede2 | 5 | 0.03 | 0.96 | 1 | -0.13 | -0.37 | -0.37 |
Total marks = 0.08
All linear projections are nesciarily a lossy representation of the full data. By this we mean that no single 2D frame can show the whole set of infromation for \(p>=3\) -dimensional sample. Any pair of Pricipal Components nessciaronly shows less than all the variation, namely the sum of their contributions, typicaly stated as percentage of full sample variation. Analogously any single projection cannot show the full information explain the cluster seperation of 2 given clusters.
In applcation, viewing a PC1 by PC2 biplot of flea data contains 81.1 percent of the variation explained in the sample. While viewing (an orthogonal project) the top 2 variables (namely: tars1, aede1 ) explain 63.02 percent of the within sample cluster seperation between Concinna and Heptapot.
In order to stress test this Cluster seperation viewed by a screeplot we apply it to other toy datasets.
(invalid assumptions, as there are 3 species clusters for each sex) ### Penguins, between levels of sex with 1 species
Can we simulate the Cluster seperation that we expect? Lets create a simmulation that has variable contributions for the following cases:
Observe how changing the variance-covariances changes cluster seperation given that cluster means differ as 80, 20, rep(0) (singal from means is large relative to variance)
Each cluster’s covariance:
| V1 | V2 | V3 | V4 | V5 | |
|---|---|---|---|---|---|
| V1 | 1.0 | 0.3 | 0 | 0 | 0 |
| V2 | 0.3 | 1.0 | 0 | 0 | 0 |
| V3 | 0.0 | 0.0 | 1 | 0 | 0 |
| V4 | 0.0 | 0.0 | 0 | 1 | 0 |
| V5 | 0.0 | 0.0 | 0 | 0 | 1 |
Cluster ‘a’ covariance:
| V1 | V2 | V3 | V4 | V5 | |
|---|---|---|---|---|---|
| V1 | 5.0 | 0.7 | 0.7 | 0.7 | 0.7 |
| V2 | 0.7 | 5.0 | 0.7 | 0.7 | 0.7 |
| V3 | 0.7 | 0.7 | 5.0 | 0.7 | 0.7 |
| V4 | 0.7 | 0.7 | 0.7 | 5.0 | 0.7 |
| V5 | 0.7 | 0.7 | 0.7 | 0.7 | 5.0 |
Cluster ‘b’ covariance:
| V1 | V2 | V3 | V4 | V5 | |
|---|---|---|---|---|---|
| V1 | 1 | 0 | 0 | 0 | 0 |
| V2 | 0 | 1 | 0 | 0 | 0 |
| V3 | 0 | 0 | 1 | 0 | 0 |
| V4 | 0 | 0 | 0 | 1 | 0 |
| V5 | 0 | 0 | 0 | 0 | 1 |
In order to properly distinguish a difference between the 3 vizualization factors the data must be of suitable complexity, such that it has the following properties:
Lets try to evaluate our current generation of data simulations against these properties
This was a 300 series simulation done at the end of the generation 1 user study shiny app.
Seems sufficent to be complex enough not to be seen as a pair of components within the first 4 Principal Components. Now to see if we can see anything in radial tours of all variables. We view cl Sep to explore which variables should contain contributions.
Fisher, Ronald A. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7, no. 2 (September 1936): 179-88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.